Introduction

<<<<<<< HEAD

Aim:
The aim of this project is to apply classification techniques to predict novel transcription factor target genes, focusing on Sox2 and Nanog during embryonic stem cell (ESC) differentiation.

Background:
Transcriptional regulation is a fundamental process in all living organisms, driven by transcription factors that control mRNA expression. These transcriptional networks are crucial in development, lineage specification, and cell fate decisions during early embryonic development (Theunissen and Jaenisch, 2017). Recent advances in omics technologies allow for the profiling of genome-wide transcriptional and epigenetic events, providing a deeper understanding of these networks.

=======

Aim: The aim of this project is to apply classification techniques to predict novel transcription factor target genes, focusing on Sox2 and Nanog during embryonic stem cell (ESC) differentiation from muliomics data.

Background: Transcriptional regulation is a fundamental ess in all living organisms, driven by transcription factors that control mRNA expression. These transcriptional networks are crucial in development, lineage specification, and cell fate decisions during early embryonic development (Theunissen and Jaenisch, 2017). Recent advances in omics technologies allow for the profiling of genome-wide transcriptional and epigenetic events, providing a deeper understanding of these networks.

>>>>>>> 93663ae (Add theme and format)

In this project, we utilize high-temporal-resolution multi-omics data of ESC differentiation (Yang et al., 2019) to predict novel substrates of Sox2 and Nanog—two key transcription factors involved in maintaining pluripotency and guiding cell differentiation.

Dataset Overview: - Transcriptome: Time-course mRNA profiles during ESC differentiation. - Proteome: Time-course protein expression profiles during ESC differentiation. - Epigenome: Time-course ESC differentation epigonme profiles of 6 histone marks.

We will develop and validate a classification model to predict novel transcription factor target genes, focusing on Sox2 and Nanog, using the provided multi-omics datasets.

EDA

Load Required Libraries and Data

We start by loading the necessary R packages and the dataset Final_Project_ESC.RData, which contains the transcriptome, proteome, and epigenome data, along with a subset of known Sox2/Nanog target genes.

# Load necessary packages and data
load("Final_Project_ESC.RData", verbose = TRUE)
## Loading objects:
##   Transcriptome
##   Proteome
##   H3K4me3
##   H3K27me3
##   PolII
##   H3K4me1
##   H3K27ac
##   H3K9me2
##   cMyc_target_genes
##   cMyc_target_genes_subset
##   OSN_target_genes
##   OSN_target_genes_subset
suppressPackageStartupMessages({
    library(e1071)
    library(ggplot2)
    library(ROCR)
    library(calibrate)
    library(dplyr)
    library(tibble)
    library(reshape2)
    library(kernlab)
    library(caret)
    library(randomForest)
    library(adabag)
    library(gbm)
    library(xgboost)
    library(pROC)
    library(doParallel)
    library(calibrate)
})

#Describe and explore the Data set details #Transcriptome

head(Transcriptome) dim(Transcriptome) colnames(Transcriptome)

<<<<<<< HEAD

PCA analysis on the correlation matrix of the transcriptome data

cor.mat <- cor(Transcriptome) pca.mat <- prcomp(cor.mat)

Plot the PCA

grp <- rownames(pca.mat\(x) grp.col <- rainbow(nrow(pca.mat\)x)) names(grp.col) <- rownames(pca.mat$x)

Generate PCA plot

plot(pca.mat\(x[,1], pca.mat\)x[,2], col=grp.col[grp], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.mat)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.mat)\)importance[2,2]*100,1), “% variance)”))

Add sample labels to the plot

calibrate::textxy(pca.mat\(x[,1], pca.mat\)x[,2], labs=grp, cex=0.5)

#Proteome cor.proteome <- cor(Proteome) pca.proteome <- prcomp(cor.proteome) summary(pca.proteome)$importance

Using the previous correlation matrix and PCA results

cor.proteome <- cor(Proteome) pca.proteome <- prcomp(cor.proteome)

Get group labels and colors

grp <- rownames(pca.proteome\(x) # Set groups according to your data grp.col <- rainbow(nrow(pca.proteome\)x)) names(grp.col) <- rownames(pca.proteome$x)

Plot the PCA

plot(pca.proteome\(x[,1], pca.proteome\)x[,2], col=grp.col[grp], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.proteome)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.proteome)\)importance[2,2]*100,1), “% variance)”))

#General PCA and Plotting Code for Epigenome Data #h3k4me3 # PCA analysis on the correlation matrix of the H3K4me3 data cor.h3k4me3 <- cor(H3K4me3) pca.h3k4me3 <- prcomp(cor.h3k4me3)

Get group labels and colors

grp <- rownames(pca.h3k4me3\(x) grp.col <- rainbow(nrow(pca.h3k4me3\)x)) names(grp.col) <- rownames(pca.h3k4me3$x)

Generate PCA plot for H3K4me3

plot(pca.h3k4me3\(x[,1], pca.h3k4me3\)x[,2], col=grp.col[grp], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k4me3)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k4me3)\)importance[2,2]*100,1), “% variance)”))

Add sample labels to the plot

calibrate::textxy(pca.h3k4me3\(x[,1], pca.h3k4me3\)x[,2], labs=grp, cex=0.5)

#H3K27me3 # PCA analysis for H3K27me3 data cor.h3k27me3 <- cor(H3K27me3) pca.h3k27me3 <- prcomp(cor.h3k27me3)

Update group labels and colors for H3K27ac

grp.h3k27ac <- rownames(pca.h3k27ac\(x) grp.col.h3k27ac <- rainbow(nrow(pca.h3k27ac\)x)) names(grp.col.h3k27ac) <- rownames(pca.h3k27ac$x)

Generate PCA plot for H3K27ac

plot(pca.h3k27ac\(x[,1], pca.h3k27ac\)x[,2], col=grp.col.h3k27ac[grp.h3k27ac], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k27ac)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k27ac)\)importance[2,2]*100,1), “% variance)”))

Correctly label the samples for H3K27ac

calibrate::textxy(pca.h3k27ac\(x[,1], pca.h3k27ac\)x[,2], labs=grp.h3k27ac, cex=0.5)

#H3K4me1 # PCA analysis for H3K4me1 data cor.h3k4me1 <- cor(H3K4me1) pca.h3k4me1 <- prcomp(cor.h3k4me1)

Define the group labels and colors specifically for H3K4me1 data

grp.h3k4me1 <- rownames(pca.h3k4me1\(x) grp.col.h3k4me1 <- rainbow(nrow(pca.h3k4me1\)x)) names(grp.col.h3k4me1) <- rownames(pca.h3k4me1$x)

Generate PCA plot for H3K4me1

plot(pca.h3k4me1\(x[,1], pca.h3k4me1\)x[,2], col=grp.col.h3k4me1[grp.h3k4me1], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k4me1)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k4me1)\)importance[2,2]*100,1), “% variance)”))

Correctly label the samples for H3K4me1

calibrate::textxy(pca.h3k4me1\(x[,1], pca.h3k4me1\)x[,2], labs=grp.h3k4me1, cex=0.5)

#H3K9me2 # PCA analysis for H3K9me2 data cor.h3k9me2 <- cor(H3K9me2) pca.h3k9me2 <- prcomp(cor.h3k9me2)

Define the group labels and colors specifically for H3K9me2 data

grp.h3k9me2 <- rownames(pca.h3k9me2\(x) grp.col.h3k9me2 <- rainbow(nrow(pca.h3k9me2\)x)) names(grp.col.h3k9me2) <- rownames(pca.h3k9me2$x)

Generate PCA plot for H3K9me2

plot(pca.h3k9me2\(x[,1], pca.h3k9me2\)x[,2], col=grp.col.h3k9me2[grp.h3k9me2], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k9me2)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k9me2)\)importance[2,2]*100,1), “% variance)”))

Correctly label the samples for H3K9me2

calibrate::textxy(pca.h3k9me2\(x[,1], pca.h3k9me2\)x[,2], labs=grp.h3k9me2, cex=0.5)

#PolII # PCA analysis for PolII data cor.polii <- cor(PolII) pca.polii <- prcomp(cor.polii)

Define the group labels and colors specifically for PolII data

grp.polii <- rownames(pca.polii\(x) grp.col.polii <- rainbow(nrow(pca.polii\)x)) names(grp.col.polii) <- rownames(pca.polii$x)

Generate PCA plot for PolII

plot(pca.polii\(x[,1], pca.polii\)x[,2], col=grp.col.polii[grp.polii], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.polii)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.polii)\)importance[2,2]*100,1), “% variance)”))

Correctly label the samples for PolII

calibrate::textxy(pca.polii\(x[,1], pca.polii\)x[,2], labs=grp.polii, cex=0.5)

=======

Describe and explore the Data set details

Before beginning data analysis it is important understand and investigate the data. The goal of this report is to predict to predict novel transcription factor target genes from multi-omics data. For each of our datasets lets look at the strcuture of the data and perform PCA Analysis as means of identifying trends in the dataset.

Below is the temporal expression of the Triptome, Proteome, and Epigiome datasets at timepoints 0, 1, 6, 12, 24, 36, 48 hours.

Transcriptome

head(Transcriptome)
##                 0hr         1hr          6hr        12hr        24hr
## GNAI3  8.881784e-16  0.29131114  0.618947976  0.48337689  0.71864514
## CDC45  0.000000e+00  0.25062323  0.199000752  0.38800303  0.47163966
## H19    0.000000e+00  0.38475910  0.471216601  0.55341685  1.87279375
## SCML2  8.881784e-16  0.64099671  0.917100751  0.73622935  0.74708761
## NARF  -8.881784e-16 -0.40736349 -1.186942510 -0.69986049 -0.15754727
## CAV2   5.551115e-17  0.01608049  0.002154149  0.07278802  0.06737809
##              36hr        48hr        72hr
## GNAI3  0.90833508  0.63408090  0.81560693
## CDC45  0.20338260 -0.03095025 -0.55159826
## H19    2.77535428  4.17558703  4.34079023
## SCML2  0.71202112 -0.20856228 -1.12588475
## NARF  -0.73012533 -0.11611985 -0.01534405
## CAV2   0.08131677  0.46840687  1.32719756
dim(Transcriptome)
## [1] 19788     8
colnames(Transcriptome)
## [1] "0hr"  "1hr"  "6hr"  "12hr" "24hr" "36hr" "48hr" "72hr"
# PCA analysis on the correlation matrix of the transcriptome data
cor.mat <- cor(Transcriptome)
pca.mat <- prcomp(cor.mat)

# Plot the PCA
grp <- rownames(pca.mat$x)
grp.col <- rainbow(nrow(pca.mat$x))
names(grp.col) <- rownames(pca.mat$x)

# Generate PCA plot
plot(pca.mat$x[,1], pca.mat$x[,2], col=grp.col[grp], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.mat)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.mat)$importance[2,2]*100,1), "% variance)"))

# Add sample labels to the plot
calibrate::textxy(pca.mat$x[,1], pca.mat$x[,2], labs=grp, cex=0.5)

Proteome

cor.proteome <- cor(Proteome)
pca.proteome <- prcomp(cor.proteome)
summary(pca.proteome)$importance
##                              PC1       PC2       PC3       PC4        PC5
## Standard deviation     0.5233315 0.2915203 0.1624012 0.1053089 0.07579296
## Proportion of Variance 0.6763300 0.2098700 0.0651300 0.0273900 0.01419000
## Cumulative Proportion  0.6763300 0.8862000 0.9513300 0.9787100 0.99290000
##                               PC6          PC7
## Standard deviation     0.05362587 2.715731e-18
## Proportion of Variance 0.00710000 0.000000e+00
## Cumulative Proportion  1.00000000 1.000000e+00
# Using the previous correlation matrix and PCA results
cor.proteome <- cor(Proteome)
pca.proteome <- prcomp(cor.proteome)

# Get group labels and colors
grp <- rownames(pca.proteome$x)  # Set groups according to your data
grp.col <- rainbow(nrow(pca.proteome$x))
names(grp.col) <- rownames(pca.proteome$x)

# Plot the PCA
plot(pca.proteome$x[,1], pca.proteome$x[,2], col=grp.col[grp], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.proteome)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.proteome)$importance[2,2]*100,1), "% variance)"))

h3k4me3

# PCA analysis on the correlation matrix of the H3K4me3 data
cor.h3k4me3 <- cor(H3K4me3)
pca.h3k4me3 <- prcomp(cor.h3k4me3)

# Get group labels and colors
grp <- rownames(pca.h3k4me3$x)
grp.col <- rainbow(nrow(pca.h3k4me3$x))
names(grp.col) <- rownames(pca.h3k4me3$x)

# Generate PCA plot for H3K4me3
plot(pca.h3k4me3$x[,1], pca.h3k4me3$x[,2], col=grp.col[grp], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.h3k4me3)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.h3k4me3)$importance[2,2]*100,1), "% variance)"))

# Add sample labels to the plot
calibrate::textxy(pca.h3k4me3$x[,1], pca.h3k4me3$x[,2], labs=grp, cex=0.5)

H3K27me3

# PCA analysis for H3K27me3 data
cor.h3k27me3 <- cor(H3K27me3)
pca.h3k27me3 <- prcomp(cor.h3k27me3)

# Update group labels and colors for H3K27me3
grp.h3k27me3 <- rownames(pca.h3k27me3$x)
grp.col.h3k27me3 <- rainbow(nrow(pca.h3k27me3$x))
names(grp.col.h3k27me3) <- rownames(pca.h3k27me3$x)

# Generate PCA plot for H3K27me3
plot(pca.h3k27me3$x[,1], pca.h3k27me3$x[,2], col=grp.col.h3k27me3[grp.h3k27me3], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.h3k27me3)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.h3k27me3)$importance[2,2]*100,1), "% variance)"))

# Correctly label the samples for H3K27me3
calibrate::textxy(pca.h3k27me3$x[,1], pca.h3k27me3$x[,2], labs=grp.h3k27me3, cex=0.5)

H3K27ac

# PCA analysis for H3K27ac data
cor.h3k27ac <- cor(H3K27ac)
pca.h3k27ac <- prcomp(cor.h3k27ac)

# Update group labels and colors for H3K27ac
grp.h3k27ac <- rownames(pca.h3k27ac$x)
grp.col.h3k27ac <- rainbow(nrow(pca.h3k27ac$x))
names(grp.col.h3k27ac) <- rownames(pca.h3k27ac$x)

# Generate PCA plot for H3K27ac
plot(pca.h3k27ac$x[,1], pca.h3k27ac$x[,2], col=grp.col.h3k27ac[grp.h3k27ac], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.h3k27ac)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.h3k27ac)$importance[2,2]*100,1), "% variance)"))

# Correctly label the samples for H3K27ac
calibrate::textxy(pca.h3k27ac$x[,1], pca.h3k27ac$x[,2], labs=grp.h3k27ac, cex=0.5)

H3K4me1

# PCA analysis for H3K4me1 data
cor.h3k4me1 <- cor(H3K4me1)
pca.h3k4me1 <- prcomp(cor.h3k4me1)

# Define the group labels and colors specifically for H3K4me1 data
grp.h3k4me1 <- rownames(pca.h3k4me1$x)
grp.col.h3k4me1 <- rainbow(nrow(pca.h3k4me1$x))
names(grp.col.h3k4me1) <- rownames(pca.h3k4me1$x)

# Generate PCA plot for H3K4me1
plot(pca.h3k4me1$x[,1], pca.h3k4me1$x[,2], col=grp.col.h3k4me1[grp.h3k4me1], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.h3k4me1)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.h3k4me1)$importance[2,2]*100,1), "% variance)"))

# Correctly label the samples for H3K4me1
calibrate::textxy(pca.h3k4me1$x[,1], pca.h3k4me1$x[,2], labs=grp.h3k4me1, cex=0.5)

H3K9me2

# PCA analysis for H3K9me2 data
cor.h3k9me2 <- cor(H3K9me2)
pca.h3k9me2 <- prcomp(cor.h3k9me2)

# Define the group labels and colors specifically for H3K9me2 data
grp.h3k9me2 <- rownames(pca.h3k9me2$x)
grp.col.h3k9me2 <- rainbow(nrow(pca.h3k9me2$x))
names(grp.col.h3k9me2) <- rownames(pca.h3k9me2$x)

# Generate PCA plot for H3K9me2
plot(pca.h3k9me2$x[,1], pca.h3k9me2$x[,2], col=grp.col.h3k9me2[grp.h3k9me2], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.h3k9me2)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.h3k9me2)$importance[2,2]*100,1), "% variance)"))

# Correctly label the samples for H3K9me2
calibrate::textxy(pca.h3k9me2$x[,1], pca.h3k9me2$x[,2], labs=grp.h3k9me2, cex=0.5)

PolII

# PCA analysis for PolII data
cor.polii <- cor(PolII)
pca.polii <- prcomp(cor.polii)

# Define the group labels and colors specifically for PolII data
grp.polii <- rownames(pca.polii$x)
grp.col.polii <- rainbow(nrow(pca.polii$x))
names(grp.col.polii) <- rownames(pca.polii$x)

# Generate PCA plot for PolII
plot(pca.polii$x[,1], pca.polii$x[,2], col=grp.col.polii[grp.polii], pch=19, cex=2,
     xlab=paste0("PC1 (", round(summary(pca.polii)$importance[2,1]*100,1), "% variance)"),
     ylab=paste0("PC2 (", round(summary(pca.polii)$importance[2,2]*100,1), "% variance)"))

# Correctly label the samples for PolII
calibrate::textxy(pca.polii$x[,1], pca.polii$x[,2], labs=grp.polii, cex=0.5)

Classification

>>>>>>> 93663ae (Add theme and format)

Filter and Combine Datasets

In order to properly model and predicting novel transcription factors target genes we join the 8 datasets together and perform our calculations on the larger dataset.

To ensure consistency across datasets, we filter each dataset to include only the common genes present in all omics layers. We then combine these filtered datasets into a single data frame for further analysis.

# Ensure all data has the same set of genes
genes <- intersect(rownames(Transcriptome), rownames(Proteome))
genes <- intersect(genes, rownames(H3K4me3))
genes <- intersect(genes, rownames(H3K27me3))
genes <- intersect(genes, rownames(H3K27ac))
genes <- intersect(genes, rownames(H3K4me1))
genes <- intersect(genes, rownames(H3K9me2))
genes <- intersect(genes, rownames(PolII))
# Filter each dataset for the common genes
Transcriptome_filter <- Transcriptome[genes, ]
Proteome_filter <- Proteome[genes, ]
H3K4me3_filter <- H3K4me3[genes, ]
H3K27me3_filter <- H3K27me3[genes, ]
H3K27ac_filter <- H3K27ac[genes, ]
H3K4me1_filter <- H3K4me1[genes, ]
H3K9me2_filter <- H3K9me2[genes, ]
<<<<<<< HEAD
PolII_filter <- PolII[genes, ]
======= PolII_filter <- PolII[genes, ] # Confirm that all datasets share the same gene identical(rownames(Transcriptome_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(Proteome_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(H3K27ac_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(H3K4me1_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(H3K9me2_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(PolII_filter), rownames(H3K4me3_filter))
## [1] TRUE
>>>>>>> 93663ae (Add theme and format)
# Rename columns to avoid conflicts
colnames(Transcriptome_filter) <- paste("T_", colnames(Transcriptome_filter), sep = "")
colnames(Proteome_filter) <- paste("P_", colnames(Proteome_filter), sep = "")
colnames(H3K4me3_filter) <- paste("H3K4me3_", colnames(H3K4me3_filter), sep = "")
colnames(H3K27me3_filter) <- paste("H3K27me3_", colnames(H3K27me3_filter), sep = "")
colnames(H3K27ac_filter) <- paste("H3K27ac_", colnames(H3K27ac_filter), sep = "")
colnames(H3K4me1_filter) <- paste("H3K4me1_", colnames(H3K4me1_filter), sep = "")
colnames(H3K9me2_filter) <- paste("H3K9me2_", colnames(H3K9me2_filter), sep = "")
colnames(PolII_filter) <- paste("PolII_", colnames(PolII_filter), sep = "")

# Combine the datasets
combined_data <- cbind(
  Transcriptome_filter,
  Proteome_filter,
  H3K4me3_filter,
  H3K27me3_filter,
  H3K27ac_filter,
  H3K4me1_filter,
  H3K9me2_filter,
  PolII_filter
)

# Add the labels
label <- ifelse(genes %in% OSN_target_genes_subset, "OSN", "Other")
combined_data <- data.frame(combined_data)
combined_data$label <- factor(label)
<<<<<<< HEAD =======
# Number of genes which are known to be targets for Sox2 and Nanog
length(OSN_target_genes_subset)
## [1] 100

We have 100 known target genes for OSN, and as seen below the we have 95 genes that have been identified as novel Sox2/Nanog targets on our combined filtered dataset.

>>>>>>> 93663ae (Add theme and format)
# Check the initial label distribution
print(table(combined_data$label))
## 
##   OSN Other 
##    95  8085
<<<<<<< HEAD

Data Splitting and Balancing

=======

Data Splitting and Balancing

>>>>>>> 93663ae (Add theme and format)

The dataset is split into training (90%) and testing (10%) sets. The label column is reassigned to the test set to ensure that it is included correctly.

# Split the dataset into training (90%) and testing (10%) sets
set.seed(123)
train_index <- createDataPartition(combined_data$label, p = 0.9, list = FALSE)

train_data <- combined_data[train_index, ]
test_data <- combined_data[-train_index, ]

# Reassign the label column to test_data
test_data$label <- combined_data[-train_index, "label"]

# Check the distribution of labels in the training and test sets
print("Training set label distribution:")
## [1] "Training set label distribution:"
print(table(train_data$label))
## 
##   OSN Other 
##    86  7277
print("Test set label distribution:")
## [1] "Test set label distribution:"
print(table(test_data$label))
## 
##   OSN Other 
##     9   808

Balancing the Training Data

Since the dataset is imbalanced, downsampling is used to create a balanced training dataset. This ensures that both classes (OSN and Other) are equally represented, improving the robustness of the classification model.

# Balance the training data using downsampling
set.seed(123)
downsampled_train_data <- downSample(x = train_data[, -ncol(train_data)],
                                     y = train_data$label,
                                     yname = "label")

# Check the balanced label distribution
print("Balanced training set label distribution:")
## [1] "Balanced training set label distribution:"
print(table(downsampled_train_data$label))
## 
##   OSN Other 
##    86    86
# Final check of training dataset dimensions
print(dim(downsampled_train_data))
## [1] 172  64
<<<<<<< HEAD

Model Training

Train SVM and Random Forest Models We train two machine learning models, SVM and Random Forest, using the balanced training dataset.

=======

Model Training

We train two machine learning models, SVM (with the radial kernel) and Random Forest, using the balanced training dataset.

>>>>>>> 93663ae (Add theme and format)
# Set a consistent seed
set.seed(123)
# Train an SVM model on the downsampled training data with radial basis function kernel
svm_model <- svm(label ~ ., data = downsampled_train_data, kernel = "radial", probability = TRUE)

# Train a Random Forest model on the downsampled training data
set.seed(123)
rf_model <- randomForest(label ~ ., data = downsampled_train_data, ntree = 1000)
<<<<<<< HEAD

Bagging w/ Bagged Trees

=======

Bagging w/ Bagged Trees

>>>>>>> 93663ae (Add theme and format)
bagged_trees <- train(
  label ~ .,
  data = downsampled_train_data,
  method = "treebag",
  trControl = trainControl(method = "cv", number = 5),
  tuneLength = 5
)
print(bagged_trees)
## Bagged CART 
## 
## 172 samples
##  63 predictor
##   2 classes: 'OSN', 'Other' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 137, 138, 138, 137, 138 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7447059  0.4888148

GBM w/ Hyperparam Tuning and xgboost (using parallel processing)

For hyperparameters tuning we define the grid of paramaters below and train an XGB Model on the downsampled data. We use cross validation as means to providing a reliable model that is not overfit, and allow the model to be trained in parallel.

tune_grid_xgb <- expand.grid(
  nrounds = c(100, 200),
  max_depth = c(3, 6),
  eta = c(0.1, 0.3),
  gamma = c(0, 0.1),
  colsample_bytree = c(0.5, 1),
  min_child_weight = c(1, 10),
  subsample = c(0.5, 1)
)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

train_control <- trainControl(
  method = "cv",
  number = 3,
  savePredictions = "final",
  verboseIter = TRUE,
  allowParallel = TRUE
)
<<<<<<< HEAD
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 3, eta = 0.1, gamma = 0.1, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5 on full training set
print(xgb_model_tuned)
## eXtreme Gradient Boosting 
## 
## 172 samples
##  63 predictor
##   2 classes: 'OSN', 'Other' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 114, 116, 114 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  gamma  colsample_bytree  min_child_weight  subsample  nrounds
##   0.1  3          0.0    0.5                1                0.5        100    
##   0.1  3          0.0    0.5                1                0.5        200    
##   0.1  3          0.0    0.5                1                1.0        100    
##   0.1  3          0.0    0.5                1                1.0        200    
##   0.1  3          0.0    0.5               10                0.5        100    
##   0.1  3          0.0    0.5               10                0.5        200    
##   0.1  3          0.0    0.5               10                1.0        100    
##   0.1  3          0.0    0.5               10                1.0        200    
##   0.1  3          0.0    1.0                1                0.5        100    
##   0.1  3          0.0    1.0                1                0.5        200    
##   0.1  3          0.0    1.0                1                1.0        100    
##   0.1  3          0.0    1.0                1                1.0        200    
##   0.1  3          0.0    1.0               10                0.5        100    
##   0.1  3          0.0    1.0               10                0.5        200    
##   0.1  3          0.0    1.0               10                1.0        100    
##   0.1  3          0.0    1.0               10                1.0        200    
##   0.1  3          0.1    0.5                1                0.5        100    
##   0.1  3          0.1    0.5                1                0.5        200    
##   0.1  3          0.1    0.5                1                1.0        100    
##   0.1  3          0.1    0.5                1                1.0        200    
##   0.1  3          0.1    0.5               10                0.5        100    
##   0.1  3          0.1    0.5               10                0.5        200    
##   0.1  3          0.1    0.5               10                1.0        100    
##   0.1  3          0.1    0.5               10                1.0        200    
##   0.1  3          0.1    1.0                1                0.5        100    
##   0.1  3          0.1    1.0                1                0.5        200    
##   0.1  3          0.1    1.0                1                1.0        100    
##   0.1  3          0.1    1.0                1                1.0        200    
##   0.1  3          0.1    1.0               10                0.5        100    
##   0.1  3          0.1    1.0               10                0.5        200    
##   0.1  3          0.1    1.0               10                1.0        100    
##   0.1  3          0.1    1.0               10                1.0        200    
##   0.1  6          0.0    0.5                1                0.5        100    
##   0.1  6          0.0    0.5                1                0.5        200    
##   0.1  6          0.0    0.5                1                1.0        100    
##   0.1  6          0.0    0.5                1                1.0        200    
##   0.1  6          0.0    0.5               10                0.5        100    
##   0.1  6          0.0    0.5               10                0.5        200    
##   0.1  6          0.0    0.5               10                1.0        100    
##   0.1  6          0.0    0.5               10                1.0        200    
##   0.1  6          0.0    1.0                1                0.5        100    
##   0.1  6          0.0    1.0                1                0.5        200    
##   0.1  6          0.0    1.0                1                1.0        100    
##   0.1  6          0.0    1.0                1                1.0        200    
##   0.1  6          0.0    1.0               10                0.5        100    
##   0.1  6          0.0    1.0               10                0.5        200    
##   0.1  6          0.0    1.0               10                1.0        100    
##   0.1  6          0.0    1.0               10                1.0        200    
##   0.1  6          0.1    0.5                1                0.5        100    
##   0.1  6          0.1    0.5                1                0.5        200    
##   0.1  6          0.1    0.5                1                1.0        100    
##   0.1  6          0.1    0.5                1                1.0        200    
##   0.1  6          0.1    0.5               10                0.5        100    
##   0.1  6          0.1    0.5               10                0.5        200    
##   0.1  6          0.1    0.5               10                1.0        100    
##   0.1  6          0.1    0.5               10                1.0        200    
##   0.1  6          0.1    1.0                1                0.5        100    
##   0.1  6          0.1    1.0                1                0.5        200    
##   0.1  6          0.1    1.0                1                1.0        100    
##   0.1  6          0.1    1.0                1                1.0        200    
##   0.1  6          0.1    1.0               10                0.5        100    
##   0.1  6          0.1    1.0               10                0.5        200    
##   0.1  6          0.1    1.0               10                1.0        100    
##   0.1  6          0.1    1.0               10                1.0        200    
##   0.3  3          0.0    0.5                1                0.5        100    
##   0.3  3          0.0    0.5                1                0.5        200    
##   0.3  3          0.0    0.5                1                1.0        100    
##   0.3  3          0.0    0.5                1                1.0        200    
##   0.3  3          0.0    0.5               10                0.5        100    
##   0.3  3          0.0    0.5               10                0.5        200    
##   0.3  3          0.0    0.5               10                1.0        100    
##   0.3  3          0.0    0.5               10                1.0        200    
##   0.3  3          0.0    1.0                1                0.5        100    
##   0.3  3          0.0    1.0                1                0.5        200    
##   0.3  3          0.0    1.0                1                1.0        100    
##   0.3  3          0.0    1.0                1                1.0        200    
##   0.3  3          0.0    1.0               10                0.5        100    
##   0.3  3          0.0    1.0               10                0.5        200    
##   0.3  3          0.0    1.0               10                1.0        100    
##   0.3  3          0.0    1.0               10                1.0        200    
##   0.3  3          0.1    0.5                1                0.5        100    
##   0.3  3          0.1    0.5                1                0.5        200    
##   0.3  3          0.1    0.5                1                1.0        100    
##   0.3  3          0.1    0.5                1                1.0        200    
##   0.3  3          0.1    0.5               10                0.5        100    
##   0.3  3          0.1    0.5               10                0.5        200    
##   0.3  3          0.1    0.5               10                1.0        100    
##   0.3  3          0.1    0.5               10                1.0        200    
##   0.3  3          0.1    1.0                1                0.5        100    
##   0.3  3          0.1    1.0                1                0.5        200    
##   0.3  3          0.1    1.0                1                1.0        100    
##   0.3  3          0.1    1.0                1                1.0        200    
##   0.3  3          0.1    1.0               10                0.5        100    
##   0.3  3          0.1    1.0               10                0.5        200    
##   0.3  3          0.1    1.0               10                1.0        100    
##   0.3  3          0.1    1.0               10                1.0        200    
##   0.3  6          0.0    0.5                1                0.5        100    
##   0.3  6          0.0    0.5                1                0.5        200    
##   0.3  6          0.0    0.5                1                1.0        100    
##   0.3  6          0.0    0.5                1                1.0        200    
##   0.3  6          0.0    0.5               10                0.5        100    
##   0.3  6          0.0    0.5               10                0.5        200    
##   0.3  6          0.0    0.5               10                1.0        100    
##   0.3  6          0.0    0.5               10                1.0        200    
##   0.3  6          0.0    1.0                1                0.5        100    
##   0.3  6          0.0    1.0                1                0.5        200    
##   0.3  6          0.0    1.0                1                1.0        100    
##   0.3  6          0.0    1.0                1                1.0        200    
##   0.3  6          0.0    1.0               10                0.5        100    
##   0.3  6          0.0    1.0               10                0.5        200    
##   0.3  6          0.0    1.0               10                1.0        100    
##   0.3  6          0.0    1.0               10                1.0        200    
##   0.3  6          0.1    0.5                1                0.5        100    
##   0.3  6          0.1    0.5                1                0.5        200    
##   0.3  6          0.1    0.5                1                1.0        100    
##   0.3  6          0.1    0.5                1                1.0        200    
##   0.3  6          0.1    0.5               10                0.5        100    
##   0.3  6          0.1    0.5               10                0.5        200    
##   0.3  6          0.1    0.5               10                1.0        100    
##   0.3  6          0.1    0.5               10                1.0        200    
##   0.3  6          0.1    1.0                1                0.5        100    
##   0.3  6          0.1    1.0                1                0.5        200    
##   0.3  6          0.1    1.0                1                1.0        100    
##   0.3  6          0.1    1.0                1                1.0        200    
##   0.3  6          0.1    1.0               10                0.5        100    
##   0.3  6          0.1    1.0               10                0.5        200    
##   0.3  6          0.1    1.0               10                1.0        100    
##   0.3  6          0.1    1.0               10                1.0        200    
##   Accuracy   Kappa    
##   0.7497947  0.4995895
##   0.7323481  0.4646962
##   0.7619048  0.5238095
##   0.7502053  0.5004105
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6687192  0.3374384
##   0.6687192  0.3374384
##   0.7561576  0.5123153
##   0.7210591  0.4421182
##   0.7504105  0.5008210
##   0.7329639  0.4659278
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6744663  0.3489327
##   0.6802135  0.3604269
##   0.7151067  0.4302135
##   0.7261905  0.4523810
##   0.7272167  0.4544335
##   0.7327586  0.4655172
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6744663  0.3489327
##   0.6744663  0.3489327
##   0.7676519  0.5353038
##   0.7559524  0.5119048
##   0.7448686  0.4897373
##   0.7563629  0.5127258
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6744663  0.3489327
##   0.6802135  0.3604269
##   0.7440476  0.4880952
##   0.7500000  0.5000000
##   0.7504105  0.5008210
##   0.7561576  0.5123153
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6629721  0.3259442
##   0.6629721  0.3259442
##   0.7329639  0.4659278
##   0.7327586  0.4655172
##   0.7331691  0.4663383
##   0.7272167  0.4544335
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6744663  0.3489327
##   0.6802135  0.3604269
##   0.7674466  0.5348933
##   0.7557471  0.5114943
##   0.7621100  0.5242200
##   0.7621100  0.5242200
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6863711  0.3727422
##   0.6863711  0.3727422
##   0.7385057  0.4770115
##   0.7502053  0.5004105
##   0.7274220  0.4548440
##   0.7333744  0.4667488
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6744663  0.3489327
##   0.6802135  0.3604269
##   0.7212644  0.4425287
##   0.7157225  0.4314450
##   0.7091544  0.4183087
##   0.7032020  0.4064039
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6629721  0.3259442
##   0.6629721  0.3259442
##   0.7385057  0.4770115
##   0.7323481  0.4646962
##   0.7448686  0.4897373
##   0.7274220  0.4548440
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6800082  0.3600164
##   0.6800082  0.3600164
##   0.7155172  0.4310345
##   0.7270115  0.4540230
##   0.7387110  0.4774220
##   0.7387110  0.4774220
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.7036125  0.4072250
##   0.7036125  0.4072250
##   0.7034072  0.4068144
##   0.7153120  0.4306240
##   0.7448686  0.4897373
##   0.7448686  0.4897373
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6800082  0.3600164
##   0.6800082  0.3600164
##   0.7376847  0.4753695
##   0.7378900  0.4757800
##   0.7446634  0.4893268
##   0.7389163  0.4778325
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6921182  0.3842365
##   0.6921182  0.3842365
##   0.7446634  0.4893268
##   0.7329639  0.4659278
##   0.7097701  0.4195402
##   0.7153120  0.4306240
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6800082  0.3600164
##   0.6800082  0.3600164
##   0.7325534  0.4651067
##   0.7385057  0.4770115
##   0.7619048  0.5238095
##   0.7619048  0.5238095
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6683087  0.3366174
##   0.6683087  0.3366174
##   0.7442529  0.4885057
##   0.7383005  0.4766010
##   0.7040230  0.4080460
##   0.7040230  0.4080460
##   0.5000000  0.0000000
##   0.5000000  0.0000000
##   0.6800082  0.3600164
##   0.6800082  0.3600164
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3, eta
##  = 0.1, gamma = 0.1, colsample_bytree = 1, min_child_weight = 1 and subsample
##  = 0.5.
# Printing the best model's details
print(xgb_model_tuned$bestTune)
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 25     100         3 0.1   0.1                1                1       0.5
# Plotting model performance
plot(xgb_model_tuned)

=======

xgb_model_tuned <- train( label ~ ., data = downsampled_train_data, method = “xgbTree”, trControl = train_control, tuneGrid = tune_grid_xgb, metric = “Accuracy” # Performance Metric ) print(xgb_model_tuned)

>>>>>>> 93663ae (Add theme and format)

Printing the best model’s details

print(xgb_model_tuned$bestTune)

Plotting model performance

plot(xgb_model_tuned)

After hyperparam tuning we…

Model Evaluation

Predictions and Confusion Matrix We use the trained models to make predictions on the test set and evaluate their performance using confusion matrices.

# Ensure column names in the test set match the training data
colnames(test_data) <- colnames(downsampled_train_data)[1:(ncol(downsampled_train_data) - 1)]  # Exclude the label column

# Check if the label column is present and correctly populated
if ("label" %in% colnames(combined_data) && length(combined_data[-train_index, "label"]) == nrow(test_data)) {
    # Assign the label to test_data
    test_data$label <- combined_data[-train_index, "label"]
} else {
    stop("The label vector is empty or has a different length than expected. Check the data preparation steps.")
}

# Ensure the label is a factor with the correct levels
test_data$label <- factor(test_data$label, levels = c("OSN", "Other"))

# SVM Predictions on the test set
svm_test_pred <- predict(svm_model, newdata = test_data[, -ncol(test_data)], probability = TRUE)
svm_test_prob <- attr(svm_test_pred, "probabilities")[, "OSN"]

# Random Forest Predictions on the test set
rf_test_pred <- predict(rf_model, newdata = test_data[, -ncol(test_data)], type = "prob")[, "OSN"]

# SVM Confusion Matrix on the test set
svm_test_conf_matrix <- table(Predicted = ifelse(svm_test_prob > 0.5, "OSN", "Other"), Actual = test_data$label)
print("SVM Test Confusion Matrix:")
## [1] "SVM Test Confusion Matrix:"
print(svm_test_conf_matrix)
##          Actual
## Predicted OSN Other
##     OSN     7   194
##     Other   2   614
# Random Forest Confusion Matrix on the test set
rf_test_conf_matrix <- table(Predicted = ifelse(rf_test_pred > 0.5, "OSN", "Other"), Actual = test_data$label)
print("Random Forest Test Confusion Matrix:")
## [1] "Random Forest Test Confusion Matrix:"
print(rf_test_conf_matrix)
##          Actual
## Predicted OSN Other
##     OSN     8   204
##     Other   1   604
<<<<<<< HEAD =======
>>>>>>> 93663ae (Add theme and format)

ROC Curve and AUC

To further assess model performance, we plot the ROC curves and calculate the AUC for both the SVM and Random Forest models.

# Evaluate the models on the test set (AUC and ROC)
svm_test_roc <- roc(test_data$label, svm_test_prob)
## Setting levels: control = OSN, case = Other
## Setting direction: controls > cases
plot(svm_test_roc, main = "SVM ROC Curve")

print(paste("Final SVM Test AUC:", auc(svm_test_roc)))
## [1] "Final SVM Test AUC: 0.766914191419142"
rf_test_roc <- roc(test_data$label, rf_test_pred)
## Setting levels: control = OSN, case = Other
## Setting direction: controls > cases
plot(rf_test_roc, main = "Random Forest ROC Curve")
<<<<<<< HEAD

=======

>>>>>>> 93663ae (Add theme and format)
print(paste("Final Random Forest Test AUC:", auc(rf_test_roc)))
## [1] "Final Random Forest Test AUC: 0.8373899889989"

Future Directions:

Model Improvement: Further hyperparameter tuning using automated methods like grid search or random search could refine the model’s accuracy. Additionally, exploring ensemble methods that combine predictions from several models might yield better results.

Data Expansion: Incorporating additional omics layers, such as metabolomics or additional transcription factor binding profiles, could help to improve the model’s predictive power and generalisability.

Integration with Clinical Data: Linking omics profiles with clinical outcomes could also be explored to improve the translational impact of the research.

<<<<<<< HEAD

Future Directions:

Model Improvement:
Further hyperparameter tuning using automated methods like grid search or random search could refine the model’s accuracy. Additionally, exploring ensemble methods that combine predictions from several models might yield better results.

Data Expansion:
Incorporating additional omics layers, such as metabolomics or additional transcription factor binding profiles, could help to improve the model’s predictive power and generalisability.

Integration with Clinical Data:
Linking omics profiles with clinical outcomes could also be explored to improve the translational impact of the research.

======= >>>>>>> 93663ae (Add theme and format)